Remote Elicitation of Inflectional Paradigms to Seed Morphological Analysis in Low-Resource Languages
نویسندگان
چکیده
Structured, complete inflectional paradigm data exists for very few of the world’s languages, but is crucial to training morphological analysis tools. We present methods inspired by linguistic fieldwork for gathering inflectional paradigm data in a machine-readable, interoperable format from remotely-located speakers of any language. Informants are tasked with completing language-specific paradigm elicitation templates. Templates are constructed by linguists using grammatical reference materials to ensure completeness. Each cell in a template is associated with contextual prompts designed to help informants with varying levels of linguistic expertise (from professional translators to untrained native speakers) provide the desired inflected form. To facilitate downstream use in interoperable NLP/HLT applications, each cell is also associated with a language-independent machine-readable set of morphological tags from the UniMorph Schema. This data is useful for seeding morphological analysis and generation software, particularly when the data is representative of the range of surface morphological variation in the language. At present, we have obtained 792 lemmas and 25,056 inflected forms from 15 languages.
منابع مشابه
Resource-Light Acquisition of Inflectional Paradigms
This paper presents a resource-light acquisition of morphological paradigms and lexicon for fusional languages. It builds upon Paramor [10], an unsupervised system, by extending it: (1) to accept a small seed of manually provided word inflections with marked morpheme boundary; (2) to handle basic allomorphic changes acquiring the rules from the seed and/or from previously acquired paradigms. Th...
متن کاملEvaluation of Finite State Morphological Analyzers Based on Paradigm Extraction from Wiktionary
Wiktionary provides lexical information for an increasing number of languages, including morphological inflection tables. It is a good resource for automatically learning rule-based analysis of the inflectional morphology of a language. This paper performs an extensive evaluation of a method to extract generalized paradigms from morphological inflection tables, which can be converted to weighte...
متن کاملA Paradigm-Based Finite State Morphological Analyzer for Marathi
A morphological analyzer forms the foundation for many NLP applications of Indian Languages. In this paper, we propose and evaluate the morphological analyzer for Marathi, an inflectional language. The morphological analyzer exploits the efficiency and flexibility offered by finite state machines in modeling the morphotactics while using the well devised system of paradigms to handle the stem a...
متن کاملVery-large Scale Parsing and Normalization of Wiktionary Morphological Paradigms
Wiktionary is a large-scale resource for cross-lingual lexical information with great potential utility for machine translation (MT) and many other NLP tasks, especially automatic morphological analysis and generation. However, it is designed primarily for human viewing rather than machine readability, and presents numerous challenges for generalized parsing and extraction due to a lack of stan...
متن کاملAutomatic Construction of Morphologically Motivated Translation Models for Highly Inflected, Low-Resource Languages
Statistical Machine Translation (SMT) of highly inflected, low-resource languages suffers from the problem of low bitext availability, which is exacerbated by large inflectional paradigms. When translating into English, rich source inflections have a high chance of being poorly estimated or out-of-vocabulary (OOV). We present a source language-agnostic system for automatically constructing phra...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2016